System and process for time delay estimation in the presence of correlated noise and reverberation

ABSTRACT

A system and process for estimating the time delay of arrival (TDOA) between a pair of audio sensors of a microphone array is presented. Generally, a generalized cross-correlation (GCC) technique is employed. However, this technique is improved to include provisions for both reducing the influence (including interference) from correlated ambient noise and reverberation noise in the sensor signals prior to computing the TDOA estimate. Two unique correlated ambient noise reduction procedures are also proposed. One involves the application of Wiener filtering, and the other a combination of Wiener filtering with a G nn  subtraction technique. In addition, two unique reverberation noise reduction procedures are proposed. Both involve applying a weighting factor to the signals prior to computing the TDOA which combines the effects of a traditional maximum likelihood (TML) weighting function and a phase transformation (PHAT) weighting function.

BACKGROUND

[0001] 1. Technical Field

[0002] The invention is related to estimating the time delay of arrival(TDOA) between a pair of audio sensors of a microphone array, and moreparticularly to a system and process for estimating the TDOA using ageneralized cross-correlation (GCC) technique that employs provisionsmaking it more robust to correlated ambient noise and reverberationnoise.

[0003] 2. Background Art

[0004] Using microphone arrays to locate a sound source has been anactive research topic since the early 1990's [2]. It has many importantapplications including video conferencing [1, 5, 10], videosurveillance, and speech recognition [8]. In general, there are threecategories of techniques for sound source localization (SSL), i.e.steered-beamformer based, high-resolution spectral estimation based, andtime delay of arrival (TDOA) based [2].

[0005] The steered-beamformer-based technique steers the array tovarious locations and searches for a peak in output power. Thistechnique can be tracked back to early 1970s. The two major shortcomingsof this technique are that it can easily become stuck in a local maximaand it exhibits a high computational cost. The high-resolutionspectral-estimation-based technique representing the second categoryuses a spatial-spectral correlation matrix derived from the signalsreceived at the microphone array sensors. Specifically, it is designedfor far-field plane waves projecting onto a linear array. In addition,it is more suited for narrowband signals, because while it can beextended to wide band signals such as human speech, the amount ofcomputation required increases significantly. The third categoryinvolving the aforementioned TDOA-based SSL technique is somewhatdifferent from the first two since the measure in question is not theacoustic data received by the microphone array sensors, but rather thetime delays between each sensor. So far, the most studied and widelyused technique is the TDOA based approach. Various TDOA algorithms havebeen developed at Brown University [2], PictureTel Corporation [10],Rutgers University [6], University of Maryland [12], USC [3], UCSD [4],and UIUC [8]. This is by no means a complete list. Instead, it is usedto illustrate how much effort researchers have put into this problem.

[0006] While researchers are making good progress on various aspects ofTDOA, there is still no good solution in real-life environment where twodestructive noise sources exist—namely, spatially correlated noise(e.g., computer fans) and room reverberation. With a few exceptions,most of the existing algorithms either assume uncorrelated noise orignore room reverberation. It has been found that testing on data withuncorrelated noise and no reverberation will almost always give perfectresults. But the algorithm will not work well in real-world situations.Thus, there needs to be a more vigorous exploration of the various noiseremoval techniques to handle the spatially correlated noise issue forreal-world situations, along with different weighting functions to dealwith the room reverberation issue. This is the focus of the presentinvention. It is noted, however, that the present invention is directedat providing more accurate “single-frame” estimates. Multiple-frametechniques, e.g., temporal filtering [11], are outside the scope of thisinvention, but can always be used to further improve the “single-frame”results. On the other hand, better single frame estimates should alsoimprove algorithms based on multiple frames.

[0007] It is further noted that in the preceding paragraphs, as well asin the remainder of this specification, the description refers tovarious individual publications identified by a numeric designatorcontained within a pair of brackets. For example, such a reference maybe identified by reciting, “reference [1]” or simply “[1]”. A listing ofreferences including the publications corresponding to each designatorcan be found at the end of the Detailed Description section.

SUMMARY

[0008] The present invention is directed toward a system and process forestimating the time delay of arrival (TDOA) between a pair of audiosensors of a microphone array using a generalized cross-correlation(GCC) technique that employs provisions making it more robust tocorrelated ambient noise and reverberation noise.(it cannot reducenoises, it can only be more robust to noise).

[0009] In the part of the present TDOA estimation system and processinvolved with reducing the influence of correlated ambient noise, oneversion applies Wiener filtering to the audio sensor signals. Thisgenerally entails multiplying the Fourier transform of the crosscorrelation of the sensor signals by a first factor representing thepercentage of the non-noise portion of the overall signal from the firstsensor and a second factor representing the percentage of the non-noiseportion of the overall signal from the second sensor. The first factoris computed by initially subtracting the overall noise power spectrum ofthe signal output by the first sensor, as estimated when there is nospeech in the sensor signal, from the energy of the sensor signal outputby the first sensor. This difference is then divided by the energy ofthe first sensor's signal to produce the first factor. The second factoris computed in the same way. Namely, the overall noise power spectrum ofthe signal output by the second sensor is subtracted from the energy ofthe sensor signal output by the second sensor, and then the differenceis divided by the energy of that signal.

[0010] An alternate version of the present correlated ambient noisereduction procedure applies a combined Wiener filtering and G_(nn)subtraction technique to the audio sensor signals. More particularly,the Fourier transform of the cross correlation of the overall noiseportion of the sensor signals as estimated when no speech is present inthe signals is subtracted from the Fourier transform of the crosscorrelation of the sensor signals. Then, the difference is multiplied bythe aforementioned first and second Wiener filtering factors to furtherreduce the correlated ambient noise in the signals.

[0011] In the part of the present TDOA estimation system and processinvolved with reducing reverberation noise in the sensor signals, afirst version applies a weighting factor that is in essence acombination of a traditional maximum likelihood (TML) weighting functionand a phase transformation (PHAT) weighting function. This combinedweighting function W_(MLR)(ω) is defined as${W_{MLR}(\omega)} = \frac{{{X_{1}(\omega)}}{{X_{2}(\omega)}}}{\begin{matrix}{{2q{{X_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}} +} \\{{\left( {1 - q} \right){{N_{2}(\omega)}}^{2}{{X_{1}(\omega)}}^{2}} + {{{N_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}}}\end{matrix}}$

[0012] where X₁(ω) is the fast Fourier transform (FFT) of the signalfrom a first of the pair of audio sensors, X₂(ω) is the FFT of thesignal from the second of the pair of audio sensors, |N₁(ω)|² is thenoise power spectrum associated with the signal from the first sensor,|N₂(ω)|² is noise power spectrum associated with the signal from thesecond sensor, and q is a proportion factor.

[0013] The proportion factor q ranges between 0 and 1.0, and can bepre-selected to reflect the anticipated proportion of the correlatedambient noise to the reverberation noise. Alternately, proportion factorq can be set to the estimated ratio between the energy of thereverberation and total signal (direct path plus reverberation) at themicrophones.

[0014] In another version of the process involved with reducing theinfluence (including interference) from reverberation noise in thesensor signals, a weighting factor is applied that switches between thetraditional maximum likelihood (TML) weighting function and the phasetransformation (PHAT) weighting function. More particularly, wheneverthe signal-to-noise ratio (SNR) of the sensor signals exceeds aprescribed SNR threshold, the PHAT weighting function is employed, andwhenever the SNR of the signals is less than or equal to the prescribedSNR threshold, the TML weighting function is employed. In testedembodiments of the present system and process, the prescribed SNRthreshold was set to about 15 dB.

[0015] It is noted that the foregoing procedures are typically performedon a block by block basis where small blocks of audio data aresimultaneously sampled from the sensor signals to produce a sequence ofconsecutive blocks of the signal data from each signal. Each block ofsignal data is captured over a prescribed period of time and is at leastsubstantially contemporaneous with blocks of the other signal sampled atthe same time. The procedures are then performed on each contemporaneouspair of blocks of signal data.

[0016] In addition to the just described benefits, other advantages ofthe present invention will become apparent from the detailed descriptionwhich follows hereinafter when taken in conjunction with the drawingfigures which accompany it.

DESCRIPTION OF THE DRAWINGS

[0017] The specific features, aspects, and advantages of the presentinvention will become better understood with regard to the followingdescription, appended claims, and accompanying drawings where:

[0018]FIG. 1 is a diagram depicting a general purpose computing deviceconstituting an exemplary system for implementing the present invention.

[0019]FIG. 2 is a flow chart diagramming an overall process forestimating the TDOA between a pair of audio sensors of a microphonearray according to the present invention.

[0020]FIG. 3 depicts a graph plotting the variation in the estimatedangle associated with the direction of a sound source as derived using aTDOA computed with various correlated noise removal methods including NoRemoval (NR), G_(nn) Subtraction (GS), Wiener Filtering (WF), and bothWF and GS (WG), which are represented by the vertical bars grouped infour actual angle categories (i.e., 10, 30, 50 and 70 degrees), wherethe vertical axis shows the error in degrees. The center of each barrepresents the average estimated angle over the 500 frames and theheight of each bar represents 2× the standard deviation of the 500estimates.

[0021]FIG. 4 depicts a graph plotting the variation in the estimatedangle associated with the direction of a sound source as derived using aTDOA computed with various reverberation noise removal methods includingW_(PHAT)(w), W_(TML)(w), W_(MLR)(w) with (q=0.3), and W_(SWITCH)(w),which are represented by the vertical bars grouped in four actual anglecategories (i.e., 10, 30, 50 and 70 degrees), where the vertical axisshows the error in degrees. The center of each bar represents theaverage estimated angle over the 500 frames and the height of each barrepresents 2× the standard deviation of the 500 estimates.

[0022]FIG. 5 depicts a graph plotting the variation in the estimatedangle associated with the direction of a sound source as derived using aTDOA computed via various combined correlated and reverberation noiseremoval methods including W_(MLR)(w)-WG and W_(SWITCH)(w)-WG andW_(AMLR)(w)-GS, which are represented by the vertical bars grouped infour actual angle categories (i.e., 10, 30, 50 and 70 degrees), wherethe vertical axis shows the error in degrees. The center of each barrepresents the average estimated angle over the 500 frames and theheight of each bar represents 2× the standard deviation of the 500estimates.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0023] In the following description of the preferred embodiments of thepresent invention, reference is made to the accompanying drawings whichform a part hereof, and in which is shown by way of illustrationspecific embodiments in which the invention may be practiced. It isunderstood that other embodiments may be utilized and structural changesmay be made without departing from the scope of the present invention.

1.0 The Computing Environment

[0024] Before providing a description of the preferred embodiments ofthe present invention, a brief, general description of a suitablecomputing environment in which the invention may be implemented will bedescribed. FIG. 1 illustrates an example of a suitable computing systemenvironment 100. The computing system environment 100 is only oneexample of a suitable computing environment and is not intended tosuggest any limitation as to the scope of use or functionality of theinvention. Neither should the computing environment 100 be interpretedas having any dependency or requirement relating to any one orcombination of components illustrated in the exemplary operatingenvironment 100.

[0025] The invention is operational with numerous other general purposeor special purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like.

[0026] The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

[0027] With reference to FIG. 1, an exemplary system for implementingthe invention includes a general purpose computing device in the form ofa computer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

[0028] Computer 110 typically includes a variety of computer readablemedia. Computer readable media can be any available media that can beaccessed by computer 110 and includes both volatile and nonvolatilemedia, removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 110. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of the any of the aboveshould also be included within the scope of computer readable media.

[0029] The system memory 130 includes computer storage media in the formof volatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

[0030] The computer 110 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through an non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

[0031] The drives and their associated computer storage media discussedabove and illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 110 through input devices such as akeyboard 162 and pointing device 161, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include ajoystick, game pad, satellite dish, scanner, or the like. These andother input devices are often connected to the processing unit 120through a user input interface 160 that is coupled to the system bus121, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). A monitor191 or other type of display device is also connected to the system bus121 via an interface, such as a video interface 190. In addition to themonitor, computers may also include other peripheral output devices suchas speakers 197 and printer 196, which may be connected through anoutput peripheral interface 195. Of particular significance to thepresent invention, a microphone array 192, and/or a number of individualmicrophones (not shown) are included as input devices to the personalcomputer 110. The signals from the the microphone array 192 (and/orindividual microphones if any) are input into the computer 110 via anappropriate audio interface 194. This interface 194 is connected to thesystem bus 121, thereby allowing the signals to be routed to and storedin the RAM 132, or one of the other data storage devices associated withthe computer 110.

[0032] The computer 110 may operate in a networked environment usinglogical connections to one or more remote computers, such as a remotecomputer 180. The remote computer 180 may be a personal computer, aserver, a router, a network PC, a peer device or other common networknode, and typically includes many or all of the elements described aboverelative to the computer 110, although only a memory storage device 181has been illustrated in FIG. 1. The logical connections depicted in FIG.1 include a local area network (LAN) 171 and a wide area network (WAN)173, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

[0033] When used in a LAN networking environment, the computer 110 isconnected to the LAN 171 through a network interface or adapter 170.When used in a WAN networking environment, the computer 110 typicallyincludes a modem 172 or other means for establishing communications overthe WAN 173, such as the Internet. The modem 172, which may be internalor external, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on memory device 181. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

[0034] The exemplary operating environment having now been discussed,the remaining part of this description section will be devoted to adescription of the program modules embodying the invention. Generally,the system and process according to the present invention involvesestimating the time delay of arrival (TDOA) between a pair of audiosensors of a microphone array. In general, this is accomplished via thefollowing process actions, as shown in the high-level flow diagram ofFIG. 2:

[0035] a) inputting signals generated by the audio sensors (processaction 200); and,

[0036] b) estimating the TDOA using a generalized cross-correlation(GCC) technique that employs both a provision for reducing correlatedambient noise, and a weighting factor for reducing reverberation noise(process action 202).

2.0 TDOA Framework

[0037] The general framework for TDOA is to choose the highest peak fromthe cross correlation curve of two microphones. Let s(n) be the sourcesignal, and x₁(n) and x₂(n) be the signals received by the twomicrophones, then: $\begin{matrix}{\begin{matrix}{{x_{1}(n)} = {{s_{1}(n)} + {{h_{1}(n)}*{s(n)}} + {n_{1}(n)}}} \\{= {{a_{1}{s\left( {n - D} \right)}} + {{h_{1}(n)}*{s(n)}} + {n_{1}(n)}}}\end{matrix}\begin{matrix}{{x_{2}(n)} = {{s_{2}(n)} + {{h_{2}(n)}*{s(n)}} + {n_{2}(n)}}} \\{= {{a_{2}{s(n)}} + {{h_{2}(n)}*{s(n)}} + {n_{2}(n)}}}\end{matrix}} & (1)\end{matrix}$

[0038] where D is the TDOA, a₁ and a₂ are signal attenuations, n₁(n) andn₂(n) are the additive noise, and h₁(n)*s(n) and h₂(n)*s(n) representthe reverberation. If one can recover the cross correlation betweens₁(n) and s₂(n), i.e., {circumflex over (R)}_(s) ₁ _(s) ₂ (τ), orequivalently its Fourier transform Ĝ_(s) ₁ _(s) ₂ (ω)),then D can beestimated. In the most simplified case [3, 8], the following assumptionsare made:

[0039] 1. signal and noise are uncorrelated;

[0040] 2. noises at the two microphones are uncorrelated; and

[0041] 3. there is no reverberation.

[0042] With the above assumptions, Ĝ_(s) ₁ _(s) ₂ (ω) can beapproximated by Ĝ_(x) ₁ _(x) ₂ (ω), and D can be estimated as follows:$\begin{matrix}{{D = {\underset{\tau}{{\arg \quad \max}\quad}{{\hat{R}}_{s_{1}s_{2}}(\tau)}}}{{\hat{R}}_{s_{1}s_{2}}(\tau)} = {{\frac{1}{2\pi}{\int_{- \pi}^{\pi}{{{\hat{G}}_{s_{1}s_{2}}(\omega)}^{j\omega\tau}{\omega}}}}\quad \approx {\frac{1}{2\pi}{\int_{- \pi}^{\pi}{{{\hat{G}}_{x_{1}x_{2}}(\omega)}^{j\omega\tau}{\omega}}}}}} & (2)\end{matrix}$

[0043] While the first assumption is valid most of the time, the othertwo are not. Estimating D based on Eq. (2) therefore can easily breakdown in real-world situations. To deal with this issue, variousfrequency weighting functions have been proposed, and the resultingframework is called generalized cross correlation, i.e.: $\begin{matrix}{{D = {\underset{\tau}{\arg \quad \max}\quad {{\hat{R}}_{s_{1}s_{2}}(\tau)}}}{{{\hat{R}}_{s_{1}s_{2}}(\tau)} \approx {\frac{1}{2\pi}{\int_{- \pi}^{\pi}{{W(\omega)}{{\hat{G}}_{x_{1}x_{2}}(\omega)}^{j\omega\tau}{\omega}}}}}} & (3)\end{matrix}$

[0044] where W(w) is the frequency weighting function.

[0045] In practice, choosing the right weighting function is of greatsignificance. Early research on weighting functions can be traced backto the 1970's [6]. As can be seen from Eq. (1), there are two types ofnoise in the system, i.e., the ambient noise n₁(n) and n₂(n) andreverberation h₁(n)*s(n) and h₂(n)*s(n). Previous research [2, 6]suggests that the traditional maximum likelihood (TML) weightingfunction is robust to ambient noise and the phase transformation (PHAT)weighting function is better dealing with reverberation: $\begin{matrix}{{W_{TML}(\omega)} = \frac{{{X_{1}(\omega)}}{{X_{2}(\omega)}}}{{{{N_{2}(\omega)}}^{2}{{X_{1}(\omega)}}^{2}} + {{{N_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}}}} & (4) \\{{W_{PHAT}(\omega)} = \frac{1}{{{\hat{G}}_{x_{1}x_{2}}(\omega)}}} & (5)\end{matrix}$

[0046] where X_(i)(w) and |N_(i)(w)|², for i=1,2, are the Fouriertransform of the signal and the noise power spectrum, respectively. Itis interesting to note that while W_(TML)(w) can be mathematicallyderived [6], W_(PHAT)(w) is purely heuristics based. Most of theexisting work [2, 3, 6, 8, 12] uses either W_(TML)(w) or W_(PHAT)(w).

3.0 A Two-stage Perspective

[0047] In this section, the TDOA estimation problem will be analyzed asa two-stage process—namely first removing the correlated noise and thenattempting to minimize the reverberation effect.

3.1 Correlated Noise Removal

[0048] In offices and conference rooms, there are many noise sources,e.g., ceiling fans, computer fans and computer hard drives. These noiseswill be heard by both microphones. It is therefore unrealistic to assumen₁(n) and n₂(n) are uncorrelated. They are, however, stationary orshort-time stationary, such that it is possible to estimate the noisespectrum over time. Three techniques will now be described for removingcorrelated noise. While the first one is known [10], the other two arenovel to the present invention.

[0049] 3.1.1 G_(nn) Subtraction (GS)

[0050] If n₁(n) and n₂(n) are correlated, then Ĝ_(x) ₁ _(x) ₂ (ω)=Ĝ_(s)₁ _(s) ₂ (ω)+Ĝ_(n) ₁ _(n) ₂ (ω). Therefore, a better estimate of Ĝ_(s) ₁_(s) ₂ (ω) can be obtained as:

Ĝ _(s) ₁ _(s) ₂ ^(GS)(ω)=Ĝ _(x) ₁ _(x) ₂ (ω)−Ĝ _(n) ₁ _(n) ₂ (ω)  (6)

[0051] where Ĝ_(n) ₁ _(n) ₂ (ω) is estimated when there is no speech.

[0052] 3.1.2 Wiener Filtering (WF)

[0053] Wiener filtering reduces stationary noise. If each microphone'ssignal is passed through a Wiener filter, it would be expected to see alesser amount of correlated noise in Ĝ_(x) ₁ _(x) ₂ (ω). Thus,

Ĝ _(s) ₁ _(s) ₂ ^(GS)(ω)=W ₁(ω)W ₂(ω)Ĝ _(x) ₁ _(x) ₂(ω)

W _(i)(ω)=(|X _(i)(ω)|² −|N _(i)(ω)|²)/|X _(i)(ω)|²  (7)

i=1,2

[0054] where |N_(i)(w)|² is estimated when there is no speech.

[0055] 3.1.3. Wiener Filtering and G_(nn) Subtraction (WG)

[0056] Wiener filtering will not completely remove the stationary noise.However, the residual can further be removed by using GS. Thus,combining Wiener filtering with G_(nn) subtraction can produce evenbetter noise reduction results. This combined correlated noise removaltechnique (referred to as WG herein) is defined by:

Ĝ _(s) ₁ _(s) ₂ ^(WG)(ω)=W ₁(ω)W ₂(ω)(Ĝ _(x) ₁ _(x) ₂ (ω)−Ĝ _(n) ₁ _(n)₂ (ω))  (8)

3.2 Alleviating Reverberation Effects

[0057] While there are existing techniques to remove correlated noise asdiscussed above, no effective technique is available to removereverberation. But it is possible to alleviate the reverberation effectto a certain extent using a maximum likelihood weighting function.

[0058] Even though reverberation is thought of as correlated noise inthat it effects the signal produced by both microphones, a closerexamination reveals that it is not correlated in the frequency domain.When reverberation noise is viewed in the frequency domain over a frameof audio input it is discovered that it acts independently of frequency.In other words, contrary to what may have been intuitive and the commonbelief in the field of noise reduction, between each frequency the delayin the reverberation signal reaching each microphone varies and the sumof these delays tends toward zero. Thus, in practical termsreverberation noise is not correlated to the source. Given thisrealization, it becomes clear that reverberation noise can be filteredout of the microphone signal. One embodiment of a process for filteringout reverberation will now be described.

[0059] If reverberation is considered as just another type of noise,then

|N _(i) ^(T)(ω)|² =|H _(i)(ω)|² |S(ω)|² +|N _(i)(ω)|²  (9)

[0060] where |N_(i) ^(T)(w)|² represents the total noise. Further, if itis assumed that the phase of H_(i)(ω) is random and independent of S(ω)as indicated above, then E{S(ω)H_(i)(ω)S*(ω)}=0, and, from Eq. (1), thefollowing energy equation formed,

|X _(i)(ω)|² =a|S(ω)|² +|H _(i)(ω)|² |S(ω)|² +|N _(i)(ω)|²  (10)

[0061] Both the reverberant signal and the direct-path signal are causedby the same source. The reverberant energy is therefore proportional tothe direct-path energy, by a constant p. Thus,

|m_(i)(ω)|²|(ω)|²|_(i)(ω)|² _(p)|(ω)|² _(p)/()(|_(i)(ω)|² |N_(i)(ω)|²)  (1)

[0062] The total noise is therefore: $\begin{matrix}\begin{matrix}{{{N_{i}^{T}(\omega)}}^{2} = {{{p/\left( {a + p} \right)} \times \left( {{{X_{i}(\omega)}}^{2} - {{N_{i}(\omega)}}^{2}} \right)} + {{N_{i}(\omega)}}^{2}}} \\{= {{q{{X_{i}(\omega)}}^{2}} + {\left( {1 - q} \right){{N_{i}(\omega)}}^{2}}}}\end{matrix} & (12)\end{matrix}$

[0063] where q=p/(a+p). If Eq. (12) is substituted into Eq. (4), the MLweighting function for the reverberant situation is created. Namely,$\begin{matrix}{{W_{MLR}(\omega)} = \frac{{{X_{1}(\omega)}}{{X_{2}(\omega)}}}{\begin{matrix}{{2q{{X_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}} +} \\{{\left( {1 - q} \right){{N_{2}(\omega)}}^{2}{{X_{1}(\omega)}}^{2}} + {{{N_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}}}\end{matrix}}} & (13)\end{matrix}$

[0064] It is noted that the selection of a value for q in Eq. 13 allowsthe tailoring of the weight given to the reverberation noise reductioncomponent versus the ambient (correlated) noise reduction component.Thus, with prior knowledge of the approximate mix of reverberation andambient noise anticipated, q can be set appropriately. Alternatively, ifsuch prior knowledge is not available, p can be computed to determinethe appropriate value for q. However, in practice a precise estimationor computation of q may be hard to obtain.

[0065] In view of this it is noted that when the ambient noisedominates, W_(MLR)(w) reduces to the traditional ML solution withoutreverberation W_(TML)(w) (see Eq. (4)). In addition, when thereverberation noise dominates, W_(MLR)(w) reduces to W_(PHAT)(w) (seeEq. (5)). This agrees with the previous research that PHAT is robust toreverberation when there is no ambient noise 0. These observationsuggest it is also possible to design another weighting functionheuristically, which performs almost as well as the optimum solutionprovided by W_(MLR)(w). Specifically, when the signal to noise ratio(SNR) is high, W_(PHAT)(w) is chosen and when SNR is low W_(TML)(w) ischosen. This weighting function will be referred to as W_(SWITCH)(w):$\begin{matrix}{{W_{SWITCH}(\omega)} = \left\{ \begin{matrix}{{W_{PHAT}(\omega)},} & {{SNR} > {SNR}_{0}} \\{{W_{TML}(\omega)},} & {{SNR} \leq {SNR}_{0}}\end{matrix} \right.} & (14)\end{matrix}$

[0066] where SNR₀ is a predetermined threshold, e.g., about 15 dB. Thisalternate weighting function is advantageous because SNR is relativelyeasy to estimate.

4.0 Experimental Results

[0067] We have done experiments on all the major combinations listed inTable 1. Furthermore, for the test data, we covered a wide range ofsound source angles from −80 to +80 degrees. Here we report only threesets of experiments designed to compare different techniques on thefollowing aspects:

[0068] 1. For a uniform weighting function, which noise removaltechniques is the best?

[0069] 2. If we turn off the noise removal technique, which weightingfunction performs the best?

[0070] 3. Overall, which algorithm (e.g., a particular cell in Table 1)is the best?

4.1 Test Data Description

[0071] We take into account both correlated noise and reverberation whengenerating our test data. We generated a plenitude of data using theimaging method of [9]. The setup corresponds to a 6 m×7 m×2.5 m room,with two microphones placed 15 cm apart, 1 m from the floor and 1 m froma 6 m wall (in relation to which they are centered). The absorptioncoefficient of the wall was computed to produce several reverberationtimes, but results are presented here only for T₆₀=50 ms. Furthermore,two noise sources were included: fan noise in the center of roomceiling, and computer noise in the left corner opposite to themicrophones, at 50 cm from the floor. The same room reverberation modelwas used to add reverberation to these noise signals, which were thenadded to the already reverberated desired signal. For more realisticresults, fan noise and computer noise were actually acquired from aceiling fan and from a computer. The desired signal is 60-second ofnormal speech, captured with a close talking microphone.

[0072] The sound source is generated for 4 different angles: 10, 30, 50,and 70 degrees, viewed from the center of the two microphones. The 4sources are all 3 m away from the microphone center. The SNRs are 0 dBwhen both ambient noise and reverberation noise are considered. Thesampling frequency is 44.1 KHz, and frame size is 1024 samples (˜23 ms).We band pass the raw signal to 800 Hz-4000 Hz. Each of the 4 angletesting data is 60-second long. Out of the 60-second data, i.e., 2584frames, about 500 frames are speech frames. The results reported in thissection are obtained by using all the 500 frames.

[0073] There are 4 groups in each of the FIGS. 3-5, corresponding toground truth angles at 10, 30, 50 and 70 degrees. Within each group,there are several vertical bars representing different techniques to becompared. The vertical axis in figures is error in degrees. The centerof each bar represents the average estimated angle over the 500 frames.Close to zero means small estimation bias. The height of each barrepresents 2× the standard deviation of the 500 estimates. Short barsindicate low variance. Note also that the fact that results are betterfor smaller angles is expected and intrinsic to the geometry of theproblem.

4.2 Experiment 1: Correlated Noise Removal

[0074] Here, we fix the weighting function as W_(BASE)(w) and comparethe following four noise removal techniques: No Removal (NR), G_(nn)Subtraction (GS), Wiener Filtering (WF), and both WF and GS(WG). Theresults are summarized in FIG. 3, and the following observations can bemade:

[0075] 1. All three of the correlated noise removal techniques arebetter than NR. They have smaller bias and smaller variance.

[0076] 2. WG is slightly better than the other two techniques. This isespecially true when the source angle is small.

4.3 Experiment 2: Alleviating Reverberation Effects

[0077] Here, we turn off the noise removal condition (i.e., NR in Table1), and then compare the following 4 weighting functions: W_(PHAT)(w),W_(MLR)(w), W_(MLR)(w) with (q=0.3), and W_(SWITCH)(w). The results aresummarized in FIG. 4, and the following observations can be made:

[0078] 1. Because the test data contains both correlated ambient noiseand reverberation noise, the condition for W_(PHAT)(w) is not satisfied.It therefore gives poor results, e.g., high bias at 10 degrees and highvariance at 70 degrees.

[0079] 2. Similarly, the condition for W_(TML)(w) is not satisfiedeither, and it has high bias especially when the source angle is large.

[0080] 3. Both W_(MLR)(w) and W_(SWITCH)(w) perform well, as theysimultaneously model ambient noise and reverberation.

4.4 Experiment 3: Overall Performance

[0081] Here, we are interested in the overall performance. We report ononly the two techniques according to the present invention (i.e.,W_(MLR)(w)-WG and W_(SWITCH)(w)-WG) and compare them against theapproach of [10], one of the best currently available. The technique of[10] is W_(AMLR)(w)-GS in our terminology (see Table 1). The results aresummarized in FIG. 5. The following observations can be made:

[0082] 1. All the three algorithms perform well in general—all havesmall bias and small variance.

[0083] 2. W_(MLR)(w)-WG seems to be the overall winning algorithm. It ismore consistent than the other two. For example, W_(SWITCH)(w)-WG hasbig bias at 70 degrees and W_(AMLR)(w)-GS has big variance at 50degrees.

5.0 References

[0084] [1] S. Birchfield and D. Gillmor, Acoustic source direction byhemisphere sampling, Proc. of ICASSP, 2001.

[0085] [2] M. Brandstein and H. Silverman, A practical methodology forspeech localization with microphone arrays, Technical Report, BrownUniversity, Nov. 13, 1996

[0086] [3] P. Georgiou, C. Kyriakakis and P. Tsakalides, Robust timedelay estimation for sound source localization in noisy environments,Proc. of WASPAA, 1997

[0087] [4] T. Gustafsson, B. Rao and M. Trivedi, Source localization inreverberant environments: performance bounds and ML estimation, Proc. ofICASSP, 2001.

[0088] [5] Y. Huang, J. Benesty, and G. Elko, Passive acoustic sourcelocation for video camera steering, Proc. of ICASSP, 2000.

[0089] [6] J. Kleban, Combined acoustic and visual processing for videoconferencing systems, MS Thesis, The State University of New Jersey,Rutgers, 2000

[0090] [7] C. Knapp and G. Carter, The generalized correlation methodfor estimation of time delay, IEEE Trans. on ASSP, Vol. 24, No. 4, Aug,1976

[0091] [8] D. Li and S. Levinson, Adaptive sound source localization bytwo microphones, Proc. of Int. Conf. on Robotics and Automation,Washington D.C., May 2002

[0092] [9] P. M. Peterson, Simulating the response of multiplemicrophones to a single acoustic source in a reverberant room,” J.Acoust. Soc. Amer., vol. 80, pp1527-1529, Nov. 1986.

[0093] [10] H. Wang and P. Chu, Voice source localization for automaticcamera pointing system in videoconferencing, Proc. of ICASSP, 1997

[0094] [11] D. Ward and R. Williamson, Particle filter beamforming foracoustic source localization in a reverberant environment, Proc. ofICASSP, 2002.

[0095] [12] D. Zotkin, R. Duraiswami, L. Davis, and I. Haritaoglu, Anaudio-video front-end for multimedia applications, Proc. SMC, Nashville,Tenn., 2000.

Wherefore, what is claimed is:
 1. A computer-implemented process forestimating the time delay of arrival (TDOA) between a pair of audiosensors of a microphone array, comprising using a computer to performthe following process actions: inputting signals generated by the audiosensors; and estimating the TDOA using a generalized cross-correlation(GCC) technique which, employs a provision for reducing the influencefrom correlated ambient noise, and employs a weighting factor forreducing the influence from reverberation noise.
 2. The process of claim1, wherein the process action of employing a provision in the GCCtechnique for reducing the influence from correlated ambient noise,comprises an action of applying Wiener filtering to the audio sensorsignals.
 3. The process of claim 2, wherein the process action ofapplying Wiener filtering to each of the audio sensor signals, comprisesan action of multiplying the Fourier transform of the cross correlationof the sensor signals by a factor representing the percentage of thenon-noise portion of the overall signal from the first sensor and afactor representing the percentage of the non-noise portion of theoverall signal from the second sensor.
 4. The process of claim 3,further comprising the process actions of: computing the factorrepresenting the percentage of the non-noise portion of the overallsignal from the first sensor by subtracting the overall noise powerspectrum of the signal output by a first of the sensors, as estimatedwhen there is no speech in the sensor signal, from the energy of thesensor signal output by the first sensor, and then dividing thedifference by the energy of the sensor signal output by the firstsensor; and computing the factor representing the percentage of thenon-noise portion of the overall signal from the second sensor bysubtracting said overall noise power spectrum of the signal output by asecond of the sensors from the energy of the sensor signal output by thesecond sensor, and then dividing the difference by the energy of thesensor signal output by the second sensor.
 5. The process of claim 1,wherein the process action of employing a provision in the GCC techniquefor reducing the influence from correlated ambient noise, comprises anaction of applying a combined Wiener filtering and G_(nn) subtractiontechnique to the audio sensor signals.
 6. The process of claim 5,wherein the process action of applying a combined Wiener filtering andG_(nn) subtraction technique to the audio sensor signals, comprises anaction of multiplying the difference obtained by subtracting the Fouriertransform of the cross correlation of the overall noise portion of thesensor signals, as estimated when no speech is present in the signals,from the Fourier transform of the cross correlation of the sensorsignals, by a factor representing the percentage of the non-noiseportion of the overall signal from the first sensor and a factorrepresenting the percentage of the non-noise portion of the overallsignal from the second sensor.
 7. The process of claim 6, furthercomprising the process actions of: computing the factor representing thepercentage of the non-noise portion of the overall signal from the firstsensor by subtracting the overall noise power spectrum of the signaloutput by the first sensor, as estimated when there is no speech in thesensor signal, from the energy of the sensor signal output by the firstsensor and then dividing the difference by the energy of the sensorsignal output by the first sensor; and computing the factor representingthe percentage of the non-noise portion of the overall signal from thesecond sensor by subtracting said overall noise power spectrum of thesignal output by the second sensor from the energy of the sensor signaloutput by the second sensor, and then dividing the difference by theenergy of the sensor signal output by the second sensor.
 8. The processof claim 1, wherein the process action of employing a weighting factorfor reducing the influence from the reverberation noise, comprises anaction of establishing a weighting function which is a combination of atraditional maximum likelihood (TML) weighting function and a phasetransformation (PHAT) weighting function.
 9. The process of claim 8,wherein the process action of establishing a weighting functioncomprises an action of employing W_(MLR)(ω) as the weighting function,wherein${W_{MLR}(\omega)} = \frac{{{X_{1}(\omega)}}{{X_{2}(\omega)}}}{\begin{matrix}{{2q{{X_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}} +} \\{{\left( {1 - q} \right){{N_{2}(\omega)}}^{2}{{X_{1}(\omega)}}^{2}} + {{{N_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}}}\end{matrix}}$

where x₁(ω) is the fast Fourier transform (FFT) of the signal from afirst of the pair of audio sensors, x₂(ω) is the FFT of the signal fromthe second of the pair of audio sensors, |N₁(ω)|² is the noise powerspectrum associated with the signal from the first sensor, |N₂(ω)|² isnoise power spectrum associated with the signal from the second sensor,and q is a proportion factor.
 10. The process of claim 9, wherein theproportion factor q is set to an estimated ratio between the energy ofthe reverberation and total signal at the microphones.
 11. The processof claim 9, wherein the proportion factor q ranges between 0 and 1.0 isselected to reflect the proportion of the correlated ambient noise tothe reverberation noise.
 12. The process of claim 8, wherein the processaction of establishing a weighting function comprises an action ofestablishing a switch function such that whenever the signal-to-noiseratio (SNR) of the signals exceeds a prescribed SNR threshold, the PHATweighting function is employed, and whenever the SNR of the signals isless than or equal to the prescribed SNR threshold, the TML weightingfunction is employed.
 13. The process of claim 12, wherein theprescribed SNR threshold is about 15 dB.
 14. A system for reducing theinfluence from correlated ambient noise in audio signals prior toprocessing the signals, comprising: a microphone array having at least apair of audio sensors; a general purpose computing device; a computerprogram comprising program modules executable by the computing device,wherein the computing device is directed by the program modules of thecomputer program to, input signals generated by each audio sensor of themicrophone array; simultaneously sample the inputted signals to producea sequence of consecutive blocks of the signal data from each signal,wherein each block of signal data is captured over a prescribed periodof time and is at least substantially contemporaneous with blocks of theother signal sampled at the same time; for each contemporaneous pair ofblocks of signal data, apply Wiener filtering to the audio sensorsignals.
 15. The system of claim 14, wherein the program module forapplying Wiener filtering to the audio sensor signals, comprisessub-modules for: computing a first factor representing the percentage ofthe non-noise portion of the overall signal from the first sensor bysubtracting the overall noise power spectrum of the signal output by afirst of the sensors, as estimated when there is no speech in the sensorsignal, from the energy of the sensor signal output by the first sensor,and then dividing the difference by the energy of the sensor signaloutput by the first sensor; computing a second factor representing thepercentage of the non-noise portion of the overall signal from thesecond sensor by subtracting said overall noise power spectrum of thesignal output by a second of the sensors from the energy of the sensorsignal output by the second sensor, and then dividing the difference bythe energy of the sensor signal output by the second sensor; andmultiplying the Fourier transform of the cross correlation of the sensorsignals by the first and second factors.
 16. The system of claim 14,further comprising a program module which, for each contemporaneous pairof blocks of signal data, applies a G_(nn) subtraction correlated noisereduction technique to the audio sensor signal block pair in addition tosaid Wiener filtering.
 17. The system of claim 16, wherein the programmodule for applying the G_(nn) subtraction technique to the audio sensorsignal block pair under consideration, comprises a sub-module which,prior to applying said Wiener filtering, subtracts the Fourier transformof the cross correlation of the overall noise portion of the sensorsignals, as estimated when no speech is present in the signal blocks,from the Fourier transform of the cross correlation of the sensor signalblocks, wherein said Wiener filtering is applied to the resultingdifference.
 18. A system for reducing the influence from reverberationnoise in audio signals prior to processing the signals, comprising: amicrophone array having at least a pair of audio sensors; a generalpurpose computing device; a computer program comprising program modulesexecutable by the computing device, wherein the computing device isdirected by the program modules of the computer program to, inputsignals generated by each audio sensor of the microphone array;simultaneously sample the inputted signals to produce a sequence ofconsecutive blocks of the signal data from each signal, wherein eachblock of signal data is captured over a prescribed period of time and isat least substantially contemporaneous with blocks of the other signalsampled at the same time; for each contemporaneous pair of blocks ofsignal data, employ a weighting factor W_(MLR)(ω) to reducereverberation noise, wherein${W_{MLR}(\omega)} = \frac{{{X_{1}(\omega)}}{{X_{2}(\omega)}}}{\begin{matrix}{{2q{{X_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}} +} \\{{\left( {1 - q} \right){{N_{2}(\omega)}}^{2}{{X_{1}(\omega)}}^{2}} + {{{N_{1}(\omega)}}^{2}{{X_{2}(\omega)}}^{2}}}\end{matrix}}$

where x₁(ω) is the fast Fourier transform (FFT) of the signal from afirst of the pair of audio sensors, x₂(ω) is the FFT of the signal fromthe second of the pair of audio sensors, |N₁(ω)|² is the noise powerspectrum associated with the signal from the first sensor, |N₂(ω)|² isnoise power spectrum associated with the signal from the second sensor,and q is a proportion factor.
 19. The system of claim 18, wherein theproportion factor q is set to an estimated ratio between the energy ofthe reverberation and total signal at the microphones.
 20. The system ofclaim 18, wherein the proportion factor q ranges between 0 and 1.0 isprescribed and is chosen to reflect an anticipated proportion of thecorrelated ambient noise to the reverberation noise.
 21. A system forreducing the influence from reverberation noise in audio signals priorto processing the signals, comprising: a microphone array having atleast a pair of audio sensors; a general purpose computing device; acomputer program comprising program modules executable by the computingdevice, wherein the computing device is directed by the program modulesof the computer program to, input signals generated by each audio sensorof the microphone array; simultaneously sample the inputted signals toproduce a sequence of consecutive blocks of the signal data from eachsignal, wherein each block of signal data is captured over a prescribedperiod of time and is at least substantially contemporaneous with blocksof the other signal sampled at the same time; for each contemporaneouspair of blocks of signal data, employ a weighting factor W_(SWITCH)(ω)to reduce reverberation noise, wherein W_(SWITCH)(ω)is a switch functionwhich whenever the signal-to-noise ratio (SNR) of the signal dataassociated with the blocks of signal data under consideration exceeds aprescribed SNR threshold, a PHAT weighting function is employed, andwhenever the SNR of the signals is less than or equal to the prescribedSNR threshold, a TML weighting function is employed.
 22. The system ofclaim 21, wherein the prescribed SNR threshold is about 15 dB.
 23. Acomputer-readable medium having computer-executable instructions forestimating the time delay of arrival (TDOA) between a pair of audiosensors of a microphone array, said computer-executable instructionscomprising: inputting signals generated by each audio sensor of themicrophone array; simultaneously sampling the inputted signals toproduce a sequence of consecutive blocks of the signal data from eachsignal, wherein each block of signal data is captured over a prescribedperiod of time and is at least substantially contemporaneous with blocksof the other signal sampled at the same time; for each contemporaneouspair of blocks of signal data, estimating the TDOA using a generalizedcross-correlation (GCC) technique which, employs a provision forreducing the influence from correlated ambient noise, and employs aweighting factor for reducing the influence from reverberation noise.